2015-11-24

Warmups - Problem 1

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0

What do you want to know?

Warmups - Problem 2

What's in the column names of this data?

id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.182424 2.2042219 4.195636 2.6273345 5.063641 4.540002 5.534677 4.411520 3.846151 4.175877 3.7255782
Gene 2 1.464224 0.5854472 1.859238 0.5152242 2.882808 1.364037 2.960069 1.309185 1.182724 1.187236 0.8897749
Gene 3 2.031792 0.8701078 3.281983 0.5330452 4.627315 2.182192 5.556209 2.539065 2.539960 3.041293 1.3507193

Warmups - Problem 3

How many ways can you write today's date?

What we are going to cover today

  • Reading different data formats
  • Tidying data
  • Split - apply - combine
  • Pipes
  • Joins
  • Working with dates
  • Splitting strings

Reading different data formats

  • images: library(EBImage)
  • sound: library(tuneR)
  • fixed width fields: read.fwf()
  • netCDF: library(ncdf)
  • hdf5: library(hdf5)

XML/HTML

library(XML)
src ="http://www.realclearpolitics.com/epolls/2012/president/us/republican_presidential_nomination-1452.html"
tables <- readHTMLTable(src)
polls <- tables[[1]]
head(polls)
#>                   Poll        Date Sample MoE Romney  Santorum  Gingrich 
#> 1          RCP Average  4/9 - 4/17     --  --    52.8                19.0
#> 2    CBS News/NY Times 4/13 - 4/17 268 RV 6.0      54        --        20
#> 3 CNN/Opinion Research 4/13 - 4/15  473 A 4.5      57        --        19
#> 4              PPP (D) 4/12 - 4/15 742 RV 3.6      54        --        24
#> 5             FOX News  4/9 - 4/11 376 RV 5.0      46        15        13
#>   Paul  Perry  Huntsman  Bachmann  Cain        Spread
#> 1  15.0                                  Romney +33.8
#> 2    12     --        --        --    --   Romney +34
#> 3    18     --        --        --    --   Romney +38
#> 4    14     --        --        --    --   Romney +30
#> 5    16     --        --        --    --   Romney +30

See also scrapeR, rvest

GIS

This code is a bit slow to run, but it draws all the electoral districts of Australia.

library(maptools)
xx <- readShapeSpatial(
  "http://dicook.github.io/Monash-R/data/australia/region.shp")
object.size(as(xx, "SpatialPolygons"))
xxx <- thinnedSpatialPoly(as(xx, "SpatialPolygons"), 
  tolerance=0.5, minarea=0.001, topologyPreserve=TRUE)
object.size(as(xxx, "SpatialPolygons"))
qplot(long, lat, data=xx, group=group) + geom_path() + coord_map() 

Using the packages tidyr, dplyr

library(tidyr)
library(dplyr)
data(french_fries, package = "reshape2")
knitr::kable(head(french_fries), format = "markdown", row.names = FALSE)

French fries - hot chips

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0
1 1 15 1 1.2 0.1 0.0 1.1 5.1
1 1 15 2 8.8 3.0 3.6 1.5 2.3

What would we like to know?

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Each of these questions involves different summaries of the data.

What we have and what we want

Gathering

  • When gathering, you need to specify the keys (identifiers) and the values (measures).

Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)

Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities

Gathering the French Fries

library(tidyr)
ff_long <- gather(french_fries, key = variable, value = rating, potato:painty)
head(ff_long)
#>   time treatment subject rep variable rating
#> 1    1         1       3   1   potato    2.9
#> 2    1         1       3   2   potato   14.0
#> 3    1         1      10   1   potato   11.0
#> 4    1         1      10   2   potato    9.9
#> 5    1         1      15   1   potato    1.2
#> 6    1         1      15   2   potato    8.8

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps displaying in a table).

This is called "spreading" the data.

Spread

We use the spread function from tidyr to do this:

french_fries_wide <- spread(ff_long, key = variable, value = rating)

head(french_fries_wide)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         1      10   1   11.0     6.4    0.0    0.0    0.0
#> 4    1         1      10   2    9.9     5.9    2.9    2.2    0.0
#> 5    1         1      15   1    1.2     0.1    0.0    1.1    5.1
#> 6    1         1      15   2    8.8     3.0    3.6    1.5    2.3

Lets use gather and spread to answer some questions

Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.

We need to gather the data into long form, and make plots facetted by the scale.

Ratings on the different scales

library(ggplot2)
ff.m <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep)
head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
qplot(rating, data=ff.m, binwidth=2) + 
  facet_wrap(~type, ncol=5) 

Side-by-side boxplots

qplot(type, rating, data = ff.m, fill = type, geom = "boxplot")

Do the replicates look like each other?

We will start to tackle this by plotting the replicates against each other using a scatterplot.

We need to gather the data into long form, and then get the replicates spread into separate columns.

Check replicates

head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
ff.s <- ff.m %>% spread(rep, rating)
head(ff.s)
#>   time treatment subject    type    1    2
#> 1    1         1       3  potato  2.9 14.0
#> 2    1         1       3 buttery  0.0  0.0
#> 3    1         1       3  grassy  0.0  0.0
#> 4    1         1       3  rancid  0.0  1.1
#> 5    1         1       3  painty  5.5  0.0
#> 6    1         1      10  potato 11.0  9.9

Check replicates

qplot(`1`, `2`, data=ff.s) + theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2")
qplot(`1`, `2`, data=ff.s) + theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2") + 
  scale_x_log10() + scale_y_log10()

Your turn

Make the scatterplots of reps against each other separately for scales, and treatment.

Legos = tidy data

(Courtesy of Hadley Wickham)

Play mobile = messy data

(Courtesy of Hadley Wickham)

Your turn

Read in the billboard top 100 music data, which contains N'Sync and Backstreet Boys songs that entered the billboard charts in the year 2000

billboard <- read.csv("http://dicook.github.io/Monash-R/data/billboard.csv")

What's in this data? What's X1-X76?

Your turn

  1. Use tidyr to convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)
  2. Use ggplot2 to create this time series plot:

The Split-Apply-Combine Approach

Split-Apply-Combine in dplyr

library(dplyr)
french_fries_split <- group_by(ff_long, variable) # SPLIT
french_fries_apply <- summarise(french_fries_split, rating = mean(rating, na.rm = TRUE)) # APPLY + COMBINE
french_fries_apply
#> Source: local data frame [5 x 2]
#> 
#>   variable    rating
#>     (fctr)     (dbl)
#> 1   potato 6.9525180
#> 2  buttery 1.8236994
#> 3   grassy 0.6641727
#> 4   rancid 3.8522302
#> 5   painty 2.5217579

The pipe operator

  • dplyr allows us to chain together these data analysis tasks using the %>% (pipe) operator
  • x %>% f(y) is shorthand for f(x, y)
  • Example:
french_fries %>%
    gather(key = variable, value = rating, potato:painty) %>%
    group_by(variable) %>%
    summarise(rating = mean(rating, na.rm = TRUE))
#> Source: local data frame [5 x 2]
#> 
#>   variable    rating
#>     (fctr)     (dbl)
#> 1   potato 6.9525180
#> 2  buttery 1.8236994
#> 3   grassy 0.6641727
#> 4   rancid 3.8522302
#> 5   painty 2.5217579

dplyr verbs

There are five primary dplyr verbs, representing distinct data analysis tasks:

  • Filter: Remove the rows of a data frame, producing subsets
  • Arrange: Reorder the rows of a data frame
  • Select: Select particular columns of a data frame
  • Mutate: Add new columns that are functions of existing columns
  • Summarise: Create collapsed summaries of a data frame

Filter

french_fries %>%
    filter(subject == 3, time == 1)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         2       3   1   13.9     0.0    0.0    3.9    0.0
#> 4    1         2       3   2   13.4     0.1    0.0    1.5    0.0
#> 5    1         3       3   1   14.1     0.0    0.0    1.1    0.0
#> 6    1         3       3   2    9.5     0.0    0.6    2.8    0.0

Arrange

french_fries %>%
    arrange(desc(rancid)) %>%
    head
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    9         2      51   1    7.3     2.3      0   14.9    0.1
#> 2   10         1      86   2    0.7     0.0      0   14.3   13.1
#> 3    5         2      63   1    4.4     0.0      0   13.8    0.6
#> 4    9         2      63   1    1.8     0.0      0   13.7   12.3
#> 5    5         2      19   2    5.5     4.7      0   13.4    4.6
#> 6    4         3      63   1    5.6     0.0      0   13.3    4.4

Select

french_fries %>%
    select(time, treatment, subject, rep, potato) %>%
    head
#>    time treatment subject rep potato
#> 61    1         1       3   1    2.9
#> 25    1         1       3   2   14.0
#> 62    1         1      10   1   11.0
#> 26    1         1      10   2    9.9
#> 63    1         1      15   1    1.2
#> 27    1         1      15   2    8.8

Summarise

french_fries %>%
    group_by(time, treatment) %>%
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
#> Source: local data frame [30 x 4]
#> Groups: time [?]
#> 
#>      time treatment mean_rancid sd_rancid
#>    (fctr)    (fctr)       (dbl)     (dbl)
#> 1       1         1    2.758333  3.212870
#> 2       1         2    1.716667  2.714801
#> 3       1         3    2.600000  3.202037
#> 4       2         1    3.900000  4.374730
#> 5       2         2    2.141667  3.117540
#> 6       2         3    2.495833  3.378767
#> 7       3         1    4.650000  3.933358
#> 8       3         2    2.895833  3.773532
#> 9       3         3    3.600000  3.592867
#> 10      4         1    2.079167  2.394737
#> ..    ...       ...         ...       ...

Let's use these tools to answer the rest of the french fries questions

If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person. (Assuming that each person rated on all scales.)

To check this we want to tabulate the number of records for each subject, time and treatment. This means select appropriate columns, tabulate, count and spread it out to give a nice table.

Check completeness

french_fries %>% 
  select(subject, time, treatment) %>% 
  tbl_df() %>% 
  count(subject, time) %>%
  spread(time, n)
#> Source: local data frame [12 x 11]
#> 
#>    subject     1     2     3     4     5     6     7     8     9    10
#>     (fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
#> 1        3     6     6     6     6     6     6     6     6     6    NA
#> 2       10     6     6     6     6     6     6     6     6     6     6
#> 3       15     6     6     6     6     6     6     6     6     6     6
#> 4       16     6     6     6     6     6     6     6     6     6     6
#> 5       19     6     6     6     6     6     6     6     6     6     6
#> 6       31     6     6     6     6     6     6     6     6    NA     6
#> 7       51     6     6     6     6     6     6     6     6     6     6
#> 8       52     6     6     6     6     6     6     6     6     6     6
#> 9       63     6     6     6     6     6     6     6     6     6     6
#> 10      78     6     6     6     6     6     6     6     6     6     6
#> 11      79     6     6     6     6     6     6     6     6     6    NA
#> 12      86     6     6     6     6     6     6     6     6    NA     6

Check completeness with different scales, too

french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep) %>%
  select(subject, time, treatment, type) %>% 
  tbl_df() %>% 
  count(subject, time) %>%
  spread(time, n)
#> Source: local data frame [12 x 11]
#> 
#>    subject     1     2     3     4     5     6     7     8     9    10
#>     (fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
#> 1        3    30    30    30    30    30    30    30    30    30    NA
#> 2       10    30    30    30    30    30    30    30    30    30    30
#> 3       15    30    30    30    30    30    30    30    30    30    30
#> 4       16    30    30    30    30    30    30    30    30    30    30
#> 5       19    30    30    30    30    30    30    30    30    30    30
#> 6       31    30    30    30    30    30    30    30    30    NA    30
#> 7       51    30    30    30    30    30    30    30    30    30    30
#> 8       52    30    30    30    30    30    30    30    30    30    30
#> 9       63    30    30    30    30    30    30    30    30    30    30
#> 10      78    30    30    30    30    30    30    30    30    30    30
#> 11      79    30    30    30    30    30    30    30    30    30    NA
#> 12      86    30    30    30    30    30    30    30    30    NA    30

Change in ratings over weeks, relative to experimental design

qplot(time, rating, data=ff.m, colour=treatment) + 
  facet_grid(subject~type) 

Add means over reps, and connect the dots

ff.m.av <- ff.m %>% 
  group_by(subject, time, type, treatment) %>%
  summarise(rating=mean(rating))
qplot(time, rating, data=ff.m, colour=treatment) + 
  facet_grid(subject~type) +
  geom_line(data=ff.m.av, aes(group=treatment))

Your turn

Read in the flights data:

library(nycflights13)
flights
#> Source: local data frame [336,776 x 16]
#> 
#>     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
#>    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
#> 1   2013     1     1      517         2      830        11      UA  N14228
#> 2   2013     1     1      533         4      850        20      UA  N24211
#> 3   2013     1     1      542         2      923        33      AA  N619AA
#> 4   2013     1     1      544        -1     1004       -18      B6  N804JB
#> 5   2013     1     1      554        -6      812       -25      DL  N668DN
#> 6   2013     1     1      554        -4      740        12      UA  N39463
#> 7   2013     1     1      555        -5      913        19      B6  N516JB
#> 8   2013     1     1      557        -3      709       -14      EV  N829AS
#> 9   2013     1     1      557        -3      838        -8      B6  N593JB
#> 10  2013     1     1      558        -2      753         8      AA  N3ALAA
#> ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

Your turn

This dataset contains information on over 300,000 flights that departed from New York City in the year 2013.

  1. Using dplyr and the pipe operator, create a data frame consisting of the average arrival delay (arr_delay) based on the destination airport (dest). Sort this data frame in descending order, so the destination airport with the largest delay is first.
  2. Find out the most used airports for each airline carrier.

Dates and Times

Dates are deceptively hard to work with in R.

Example: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

  • Time zones
  • POSIXct format in base R is challenging

The lubridate package helps tackle some of these issues.

Basic Lubridate Use

library(lubridate)

now()
today()
now() + hours(4)
today() - days(2)
#> [1] "2015-11-24 21:59:46 AEDT"
#> [1] "2015-11-24"
#> [1] "2015-11-25 01:59:46 AEDT"
#> [1] "2015-11-22"

Parsing Dates

ymd("2013-05-14")
mdy("05/14/2013")
dmy("14052013")
ymd_hms("2013:05:14 14:50:30", tz = "America/Chicago")
#> [1] "2013-05-14 UTC"
#> [1] "2013-05-14 UTC"
#> [1] "2013-05-14 UTC"
#> [1] "2013-05-14 14:50:30 CDT"

Your turn

  1. Using the flights data, create a new column Date using lubridate. You will need to paste together the columns year, month, and day in order to do this. See the paste function.
  2. Use dplyr to calculate the average departure delay for each date.
  3. Plot a time series of the date versus the average departure delay

String manipulation

When the experimental design is packed into the column name, we need to be able to unpack it.

id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.182424 2.2042219 4.195636 2.6273345 5.063641 4.540002 5.534677 4.411520 3.846151 4.175877 3.7255782
Gene 2 1.464224 0.5854472 1.859238 0.5152242 2.882808 1.364037 2.960069 1.309185 1.182724 1.187236 0.8897749
Gene 3 2.031792 0.8701078 3.281983 0.5330452 4.627315 2.182192 5.556209 2.539065 2.539960 3.041293 1.3507193

Gather column names into long form

id treatment expr
Gene 1 WI-6.R1 2.1824242
Gene 2 WI-6.R1 1.4642236
Gene 3 WI-6.R1 2.0317925
Gene 1 WI-6.R2 2.2042219
Gene 2 WI-6.R2 0.5854472
Gene 3 WI-6.R2 0.8701078

Split strings

library(stringr)
genes.m$treatment <- as.character(genes.m$treatment)
exp.trts <- str_split(genes.m$treatment, "\\.")
genes.m$rep <- unlist(exp.trts)[seq(2, length(exp.trts)*2, 2)]
exp.trts <- unlist(exp.trts)[seq(1, length(exp.trts)*2, 2)]
exp.trts <- str_split(exp.trts, "-")
genes.m$geno <- unlist(exp.trts)[seq(1, length(exp.trts)*2, 2)]
genes.m$time <- unlist(exp.trts)[seq(2, length(exp.trts)*2, 2)]
genes.m$genotype <- str_sub(genes.m$geno, 1, 1)
genes.m$trt <- str_sub(genes.m$geno, 2, 2)
genes.m$rep <- extract_numeric(genes.m$rep)
head(genes.m)
#> Source: local data frame [6 x 8]
#> 
#>       id treatment      expr   rep  geno  time genotype   trt
#>    (chr)     (chr)     (dbl) (dbl) (chr) (chr)    (chr) (chr)
#> 1 Gene 1   WI-6.R1 2.1824242     1    WI     6        W     I
#> 2 Gene 2   WI-6.R1 1.4642236     1    WI     6        W     I
#> 3 Gene 3   WI-6.R1 2.0317925     1    WI     6        W     I
#> 4 Gene 1   WI-6.R2 2.2042219     2    WI     6        W     I
#> 5 Gene 2   WI-6.R2 0.5854472     2    WI     6        W     I
#> 6 Gene 3   WI-6.R2 0.8701078     2    WI     6        W     I

Now we can look it, in various ways

genes.m.av <- genes.m %>% 
  group_by(id, trt, time) %>% 
  summarise(expr=mean(expr))
qplot(trt, expr, data=genes.m, colour=time) + 
  xlab("Type of modification") + ylab("Expression") + 
  facet_wrap(~id) +
  geom_line(data=genes.m.av, aes(group=time))

TODO

  • SQL/JOINS with dplyr and flight data? (Carson)
  • dplyr do() + tidyr::unnest() (Carson)
  • reading different data formats: image, json, audio, xml/html, ncdf, excel/sas/stata, GIS files, (Di)